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Chapter 1 

Introduction to Linear Algorithms 
Under Non-Stationarity 

Non-stationarity of a stochastic process is defined loosely as variability of probability distribution over time. 
Conversely, stationarity of a stochastic process corresponds to constancy of distribution. In machine learning, 
typical tasks include regression, classification or system identification. For regression and classification, no 
guarantee on generalization, from a training set to a test set, may be made under the assumption that 
the underlying process, yielding the training and test sets, is non-stationary. Thus quantification of non- 
stationarity and algorithms for choosing features which are stationary are indispensable for these tasks. 
On the other hand, in system identification, a stationary or maximally non-stationary subsystem often 
carries important significance in terms of the primitives of the domain under consideration: for example, in 
neuroscience, identification of the subsystem of the dynamics over synaptic weights with a neural network 
with n weights consisting of the subset of m < n weights whose distribution is maximally non-stationary 
under learning is of vital interest to analysis of the neural substrate underlying the learning process. 

1.1 Survey on Projection Algorithms under the Non-Stationarity As- 
sumption 

The classical literature on machine learning and statistical learning theory assumes that the samples used for 
training the parameters of, for instance, classifiers, are drawn from a single probability distribution, rather 
than a, possibly non-stationary, process [55]. That is to say, the time series from which the data are taken 
is stationary over time. Thus, approaches to learning which relax this stationarity assumption have been 
sparse up until the present time within the Machine Learning literature. In particular, the first algorithm, 
Stationary Subspace Analysis (SSA) for obtaining a linear stationary projection of a data set was published 
only recently in 2009 [54 . Since 2009 algorithms have published which refine SSA computationally [22 under 
assumptions, develop a maximum likelihood approach to SSA [27 and derive an algebraic algorithm for its 
solution [28] . In addition to SSA, linear algorithms have been published for applications, for example, sCSP 
for Brain Computer Interfacing [56 . However, non-stationarity in Machine Learning remains largely an open 
problem: for instance, no general technique for classification under non-stationarity has been proposed which 
is non-adaptive. (Adaptation has been studied in some detail, for instance, for an adaptive neural network 
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for regression or classification, see |46| : otherwise only the covariate shift problem has been studied in detail 
by, for instance, [50].) Methods which seek to perform robust classification under non-stationarity remain to 
be fully investigated. 

1.2 Overview and Layout of this Thesis 

The present thesis's contribution is twofold: firstly in Chapter [2] we present a method based on Stationary 
Subspace Analysis (SSA) [53] which addresses the system identification task described above, namely iden- 
tifying a maximally non-stationary subsystem. We subsequently apply this method to a specific machine 
learning task, namely Change Point Detection. In particular, we show that using the method based on SSA 
as a prior feature extraction step to Change Point Detection, boosts the performance of three representative 
Change Point Detection algorithms on synthetic data and data adapted from real world recordings. Note 
that this chapter (Chapter [2]) consists in part of joint work of the present author and the authors of the 
following submitted paper: [10], of which the Chapter is an adapted version. Secondly, in Chapter [3] we 
address the two-fold classification problem under non-stationarity and propose a method for adapting a com- 
monly used method for two-fold classification, namely Linear Discriminant Analysis (LDA) for this setting: 
we call the resulting algorithm stationary Linear Discriminant Analysis, or sLDA for short. We investigate 
the properties and performance of the algorithm on simulated data and on data recorded for Brain Computer 
Interface experiments. Finally having tested the algorithm we perform a rigorous investigation of the results 
obtained by means of statistical testing and comparison with base-line methods. Please note, in addition, 
that Chapter [3] includes joint work with Wojciech Samck. 



Chapter 2 



Maximizing Non-Stationarity with 
Applications to Change Point 
Detection 

2.1 Introduction 

Change Point Detection is a task that appears in a broad range of applications such as biomedical signal 
processing (T9 l l39 l I3T]. speech recognition [11145], industrial process monitoring [4, 38 , fault state Detection 
fTT] and econometrics [T3]. The goal of Change Point Detection is to find the time points at which a time 
series changes from one macroscopic state to another. As a result, the time series is decomposed into segments 
|3] of similar behavior. Change point detection is based on finding changes in the properties of the data, 
such as in the moments (mean, variance, kurtosis) 0], in the spectral properties [3 , temporal structure |3U] 
or changes w.r.t. to certain patterns |TTJ. The choice of any of these aspects depends on the particular 
application domain and on the statistical type of the changes that one aims to detect. 



Observed time series Underlying sources Comparison of source signal power 




Time Time Time 



Figure 2.1: Informative vs. uninformative directions for Change Point Detection. The left panel shows the 
observed bivariate time series where no pronounced changes are visible. The middle panel shows the two 
underlying sources, where one of them exhibits clearly visible changes. In the right panel, we see that the 
stationary sources has much higher signal power than the informative non-stationary sources and thus masks 
the presence of change-points in the observed data. 

For a large family of general segmentation algorithms, state changes are detected by comparing the 
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empirical distributions between windows of the time series [25 26 30 . Estimating and comparing probability 
densities is a difficult statistical problem, particularly in high dimensions. Often, however, many directions 
in a high dimensional signal space are uninformative for Change Point Detection: in many cases there 
exists a subspace in which the distribution of the data remains constant over time (i.e. is stationary). 
This subspace is irrelevant for Change Point Detection but increases the overall dimensionality. Moreover, 
stationary components with high signal power can make change points invisible to the observer and also to 
detection algorithms. For example, there are no change points visible in the time series depicted in the left 
panel of Figure |2.1[ even though there exists one direction in the two-dimensional signal space which clearly 
shows two change-points, as it can be seen in the middle panel. However, the non-stationary contribution is 
not visible in the observed signal because of its relatively low power (right panel). In this example, we also 
observe that it does not suffice to select channels individually, as neither of them appears informative. In 
fact, in many application domains such as biomedical engineering |57l I39| or geophysical data analysis |35| . 
it is most plausible that the data is generated as a mixture of underlying sources which we cannot measure 
directly. 

In this chapter we show how to extract useful features for Change Point Detection by finding the most 
non-stationary directions using a variant of Stationary Subspace Analysis [54 . Even though there exists 
a wide range of feature extraction methods for classification and regression [20], to date, no specialized 
procedure for feature extraction or for general signal processing [23] has been proposed for Change Point 
Detection. In controlled simulations on synthetic data, we show that for three representative Change Point 
Detection algorithms the accuracy is significantly increased by a prior feature extraction step, in particular 
if the data is high dimensional. This effect is consistent over various numbers of dimensions and strengths 
of change-points. In an application to Fault Monitoring, where the ground truth is available, we show that 
the proposed feature extraction improves the performance and leads to a dimensionality reduction where 
the desired state changes are clearly visible. Moreover, we also show that we can determine the correct 
dimensionality of the informative subspace. 

The remainder of this chapter is organized is follows. In the next Section |2.2| we introduce our feature 
extraction method that is based on an extension of Stationary Subspace Analysis. Section [2~3] contains the 
results of our simulations and in Section [2~4| we present the application to Fault Monitoring. Our conclusions 
are outlined in Section 12.51 

2.2 Feature Extraction for Change-Point Detection 

Feature extraction from raw high-dimensional data has been shown to be useful not only for improving 
the performance of subsequent learning algorithms on the derived features [2U] but also for understand- 
ing high-dimensional complex physical systems. In many application areas such as Computer Vision |16j . 
Bioinformatics |471 [55] and Text Classification [33] , defining useful features is in fact the main step towards 
successful machine learning. General feature extraction methods for classification and regression tasks are 
based on maximizing the mutual information between features and target |51j , explaining a given percentage 
of the variance in the dataset (40] , choosing features which maximize the margin between classes [34] or 
selecting informative subsets of variables through enumerative search (wrapper methods) [2U]. However, for 
Change-Point Detection no dedicated feature extraction has been proposed [I]. Unlike in classical supervised 
feature selection, where a target variable allows us to measure the informativeness of a feature, for Change- 
Point Detection we cannot tell whether a feature elicits the changes that we aim to detect since there is 
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usually no ground truth available (the problem is unsupervised). Even so, feature extraction is feasible fol- 
lowing the principle that a useful feature should exhibit significant distributional changes over time. Reducing 
the dimensionality in a pre-processing step should be particularly beneficial for the Change-Point Detection 
task: most algorithms either explicitly or implicitly make approximations to probability densities [301 I26| or 
directly compute a divergence measure based on summary statistics, such as the mean and covariance [2] 
between segments of the time series — both are hard problems whose sample complexities grow exponentially 
with the number of dimensions. 



As we have seen in the example presented in Figure 2.1 selecting channels individually (univariate ap- 



proach) is not helpful or may lead to suboptimal features. The data may be non-stationary overall despite the 
fact that each dimension seems stationary. Moreover, a single non-stationary source may be expressed across 
a large number of channels. It is therefore more sensible to estimate a linear projection of the data which 
contains as much information relating to change-points as possible. In this chapter, we demonstrate that 
finding the projection to the most non-stationary directions using a variant of Stationary Subspace Analysis 
significantly increases the performance of change-point detection algorithms. 

In the remainder of this section, we first review the SSA algorithm and show how to extend it towards 
finding the most non-stationary directions. Then we show that this approach corresponds to finding the 
projection that is most likely to be non-stationary in terms of a statistical hypothesis test. 



2.2.1 Stationary Subspace Analysis 

The following section uses material adapted from the supplementary material from the original SSA publi- 
cation [54] and, of course, the paper upon which this chapter is based (see [10J ) . 

Stationary Subspace Analysis factorizes a multivariate time series x(t) £ WL D into stationary and non- 
stationary sources according to the linear mixing model, 



x{t) = As{t) 



A s A n 



s*(t) 



(2.1) 



where s s (t) are the d s stationary sources, s n (t) are the d n [d n +d s = D) non-stationary sources, and A is an 
unknown time-constant invertible mixing matrix. The spaces spanned by the columns of the mixing matrix 
A s and A n are called 5- and n-space respectively. Note that in contrast to Independent Component Analysis 
(ICA) (22], there is no independence assumption on the sources s(t). 



The aim of SSA is to invert the mixing model (Equation 2.1 ) given only samples from the mixed sources 



x(t), i.e. we want to estimate the demixing matrix B that separates the stationary from the non-stationary 
sources. Applying B to the time series x(t) yields the estimated stationary and non-stationary sources s s (t) 
and s n (t) respectively, 



3>(t) 



Bx(t) 



B* 
B" 



x{t) 



B s A* B 5 A n 
B n A s B n A n 



s°(t) 



(2.2) 



The submatrices B s £ E, dflXl3 and B n e R,( d ») x£) of the estimated demixing matrix B project to the estimated 
stationary and non-stationary sources and are called s-projection and n-projection respectively. The estimated 
mixing matrix A is the inverse of the estimated demixing matrix, A = B~ l . 



The inverse of the SSA model (Equation 2.1l is not unique: given one demixing matrix B, any linear 



transformation within the two groups of estimated sources leads to another valid separation, because it leaves 
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the stationary resp. non-stationary nature of the sources unchanged. In addition, the separation into s- and 
n-sources itself is not unique: adding stationary components to a non-stationary source leaves the source 
non-stationary, whereas the converse is not true. That is, the n-projection can only be identified up to 
arbitrary contributions from the stationary sources. Hence one cannot recover the true n-sources, but only 
the true s-sources (up to linear transformations). Conversely, we can identify the true n-space (because the 
s-projection is orthogonal to it) but not the true s-space. However, in order to extract features for change- 
point detection, our aim is not to recover the true non-stationary sources, per se (since as we will see below, 
defining the non-stationary sources is problematic), but instead the most non-stationary ones. 

An SSA algorithm depends on a definition of stationarity, which the s-projection aims to satisfy. In the 
SSA algorithms [54j [22] , a time series X t is considered stationary if its mean and covariance is constant over 
time, i.e.: 

E[X tl ] = E[X t2 ] 
E[X tl X^]=E[X t2 X t "[], 

for all pairs of time points t i , ti £ Nq. This is a variant of weak stationarity |44| whereby time structure is not 
taken into account. Following this concept of stationarity, the SSA algorithm [53] finds the s-projection B 5 
that minimizes the difference between the first two moments of the estimated s-sources s B (t) across epochs of 
the time series. (Non-overlapping epochs of the time series are used since one cannot estimate the mean and 
covariance at a single time point.) Thus the samples from x(t) are divided into n non-overlapping epochs 
defined by the index sets Ty , ■ ■ ■ , T n C N and the epoch mean and covariance matrices are estimated as: 

A* = W\ Y x ^> and ^ = \ T \~_ 1 (*(*) _ At) (*(*) - A*) T , 

respectively for all epochs 1 < i < n. Given an s-projection, the epoch mean and covariance matrix of the 
estimated s-sources in the i-th epoch are: 

A" = B s ^ and S] = B S %{B*) T . 

The difference in the mean and covariance matrix between two epochs is measured using the Kullback-Leibler 
divergence between Gaussians. (Due to the fact that we measure only the covariance and mean, the maximum 
entropy principle tells us that the most prudent model is the Gaussian.) The objective function is the sum 
of the information theoretic difference between each epoch and the average epoch. Since the s-sources can 
only be determined up to an arbitrary linear transformation and since a global translation of the data does 
not change the difference between epoch distributions, without loss of generality the data is centered and 
whitened Q This implies that one may assume that the average epoch's mean and covariance matrix are: 

N N 

]vX> = and nY^ = L ( 2 - 3 ) 

i=l i=l 

Moreover, the search for the true s-projection may be restricted to the set of matrices with orthonormal rows, 

1 A whitening transformation is a basis transformation W that sets the sample covariance matrix to the identity. It can be 

- _ i 

obtained from the sample covariance matrix S as W = S 2 . 
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i.e. B*(B 5 ) T = I. Thus the optimization problem becomes: 

N 

B s = argmin V D KL [AA(/i*, Sf ) JV(0, /) 

TV 

argmin V (- log det Ef + (Af ) T A?) , (2.4) 

BB T —I „■_, v y 



which can be solved efficiently by using multiplicative updates with orthogonal matrices parameterized as 
matrix exponentials of antisymmetric matrices |M1 H5] [^J 

2.2.2 Finding the Most Non-Stationary Sources 

In order to extract useful features for change-point detection, we would like to find the projection to the most 
non-stationary sources. However, the SSA algorithms [54"l |2"2"] merely estimate the projection to the most 
stationary sources and choose the projection to the non-stationary sources to be orthogonal to the found 
s-projection, which means that all stationary contributions are projected out from the estimated n-sources. 
In the case in which the covariance between the sources orthogonal to the s-sources is constant over time, this 
implies that no information relating to non-stationarity is thereby lost by following this protocol to obtain 
sources containing non-stationarity. This constancy, however, may not always hold: non-stationarity may well 
reside in changing covariance between s- and n-sources. So, intuitively, we see that in order to find directions 
which do not lose information relating to the non-stationarity contained in the data, we need to propose a 
different method than simply taking the orthogonal complement of the stationary projection. The problem 
then remains whether we may frame a sensible way of defining the non-stationary sources, independently 
of the definition of the stationary sources, since any non-stationary source remains non-stationary when a 
stationary source is imposed onto that source: it is also not at all clear if there is a sensible way to define 
non-stationarity of a data set which non-circularly guarantees that the superposition of stationary noise onto 
a non-stationary direction yields a less non-stationary direction and thus allows us to recover the orthogonal 
case mentioned above; neither is it clear that this should be the case from an information theoretic point 
of view^] Therefore, we simply take, as our definition of the non-stationary sources, those sources which 
maximize the SSA loss function. This definition makes, trivially, the non-stationary sources unique; thus, 
as our working definition for non-stationarity, we are taking the measure which is optimized for SSA (see 



Equation 2.4 1. There are various independent justifications for this measure, which we will not explore too 
deeply here: these include the fact that the minimum of the SSA loss function is a consistent estimator 
for the stationary projection, that the loss is information theoretic in the sense that it is independent of 
parameterizations of the probability distribution and underlying vector space and yields positive results for 
the task at hand (a posteriori justification). Thus the rationale behind the following approach is: 

1. We define a non-stationarity measure using the SSA loss function due to 
its intuitive plausibility. 

2. We thus define the non-stationary sources as those maximizing the 
non-stationarity. 



2 An efficient implementation of SSA may be downloaded free of charge at http : //www. stationary- subspace- analysis . org/ 
toolbox] 

'The project of describing non-stationarity canonically in terms of information theory, is, however, problematic. Sec Chapter^] 
for details. 
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Thus, taking the SSA loss as our definition of non-stationarity, we aim to find the most non-stationary 
sources: this means optimizing the n-projection instead of the s-projection. Before we turn to the optimization 
problem, let us first of all analyze the situation more formally in order to develop some intuition for the 
difference between maximizing non-stationarity and taking the orthogonal complement of the stationary 
projection. 




Figure 2.2: The left panel shows two epoch covariance matrices Si and S2 where the non-stationarity is 
confined to changes in the variance along one direction, hence the most non-stationary projection B n is 
orthogonal to the true stationary projection B s . This is not the case in the situation depicted in the right 
panel: here, the covariance of the two dimensions changes between Si and S 2 , so that we can find a non- 
stationary projection that is more non-stationary than the orthogonal complement of the true s-projection. 



We consider first a simple example comprising one stationary and one non-stationary source with corre- 
sponding normalized basis vectors = 1 and \\A n \\ = 1 respectively, and we let <j> be the angle between 
the two spaces, i.e. cos0 = A sT A n . We will consider an arbitrary pair of epochs, 71 and T2, and find the 
projection B n which maximizes the difference in mean A p and variance A CT between 71 and T 2 . 

Let X\ and X2 be bivariate random variables modeling the distribution of the data in the two epochs 
respectively. According to the linear mixing model (Equation |2.1| , we can write X\ and X2 in terms of the 
underlying sources: 

X x = A S X S + A n X ni 
X 2 = A*X S + A n X n2 

where the univariate random variable X s represents the stationary source and the two univariate random 
variables X 7ll and X 7l2 model the non-stationary sources, in the epochs 7l and T2 respectively. Without loss 
of generality, we will assume that the true s-projection B s — (A* 1 ) 1 - is normalized, ||-B S || = 1. In order to 
determine the relationship between the true s-projection and the most non-stationary projection, we write 
B n in terms of B 5 and A n : 

B n = aB s + /3A nT , (2.5) 

with coefficients a, f3 E R such that ||B n || = 1. In the next step, we will observe which n-projection maximizes 
the difference in mean A M and covariance A CT between the two epochs 71 and 7~2. Let us first consider the 
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difference in the mean of the estimated n-sources: 

A M = E^Xi] - E[B n X 2 ] = B n A'\nx ni ] - E[X n2 }). 

This is maximal for B n A a = 1, i.e. when B n is orthogonal to B s . Thus, with respect to the difference in the 
mean, choosing the n-projection B n to be orthogonal to the 5-projection is always optimal, irrespective of 
the type of distribution change between epochs. 

Let us now consider the difference in variance A a of the estimated n-sources between epochs. This is 
given by: 

A a = Var^"^] - Vai[B n X 2 ] = /3 2 (Var[A ni ] - Var[X n2 ]) 



■2 



a cos i 



>s (^+») +/3cos0] (Cov[X s ,X ni ] - Cov[X t ,X na ]) . 
Clearly, when there is no change in the covariance of the s- and the n-sources between the two epochs, 



2.2 



for an 



i.e. A ersn = 0, the difference A CT is maximized for B n = (B*)- 1 . See the left panel of Figure 
example. However, when the covariance between s- and n-sources does vary, i.e. |A . sn | > 0, the projection 
{B 5 ) 1 - is no longer the most non-stationary. To see this, consider the derivative of A a with respect to the a 
at a = 0: 



dAJda\ a=0 = 2 cos + |) A asn . 



Since this derivate does not vanish, a = (see Equation 2.5 1 is not an extremum when |A 5n | > 0, which 
means that the most non-stationary projection is not orthogonal to the true s-projection. This is the case in 
the right panel of Figure |2.2| 

Thus here we have seen an example where clearly maximizing non-stationarity is equivalent to maximizing 
variance difference and we have seen that this is not equivalent to taking the orthogonal complement of the 
stationary projection. Of course, maximizing variance differences is, in general, not an appropriate definition 
for non-stationarity, which is where the SSA loss comes into play as a more general measure]^] 

Thus, in order to find the projection to the most non-stationary sources, we need to maximize the non- 
stationarity of the estimated n-sources. To that end, we propose maximizing the SSA objective function 
(Equation |2.4[ ) for the n-projection: 

N 

B" = argmax V (- log det E? + (tf) T ft) , (2.6) 

BBT=I i=1 V > 

where tf = S n S 4 (B") T and ^ = B n ji l for all epochs 1 < % < N. 



2.2.3 Relationship to Statistical Testing 

In this section we show that maximizing the SSA objective function to find the most non-stationary sources 
can be understood from a statistical testing point-of-view, in that it also maximizes a test statistic which 
minimizes the p-value for rejecting the null hypothesis that the estimated directions are stationary. In doing 

4 There are other, arguably canonical, characterizations of non-stationarity, for example, as the empirical entropy over the 
parameter space of the non-stationary process; this is a topic of current research. 
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so, we provide an alternative rationale for the loss function we maximize. In addition, the interpretation of 
the loss function in terms of testing will allow us to detect the number of actually non-stationary directions 
in dataset. 

More precisely, we maximize a test statistic which thereby minimizes a p-value for a statistical hypothesis 
test that compares two models for the data: the null hypothesis H that each epoch follows a standard normal 
distribution vs. the alternative hypothesis Ha that each epoch is Gaussian distributed with individual mean 
and covariance matrix. Let X± , . . . , Xjq be random variables modeling the distribution of the data in the N 
epochs. Formally, the hypothesis can be written as follows: 

H :X u ...,X N ~M(0,I) 

H A :Xi~ Affjn, Si), ...,X N ~ A%JV, Ejy) 

In other words, the statistical test tells us whether we should reject the simple model Hg in favor of the 
more complex model Ha- This decision is based on the value of the test statistics, whose distribution is 
known under the null hypothesis Hq. Since Hq is a special case of Ha and since the parameter estimates are 
obtained by Maximum Likelihood, we can use the likelihood ratio test statistic A |55| . which is the ratio of 
the likelihood of the data under Hq and Ha, where the parameters are their maximum likelihood estimates. 

Let X C R dn be the data set which is divided into N epochs 71, ... ,7~ n and let /t", ... , /tjy and E", . . . , EJv 
be the maximum likelihood estimates of the mean and covariance matrices of the estimated n-sources respec- 
tively. Let pw(x;/K, E) be the probability density function of the multivariate Gaussian distribution. The 
likelihood ratio test statistic is given by: 



A(X) = -2 log "*e*™^'»>y , (2.7) 



which is approximately \ 2 distributed with ^Nd n (d n + 3) degrees of freedom [55]. Using the facts that we 
have set the average epoch's mean and covariance matrix to zero and the identity matrix respectively, i.e.: 



N , N 



{?£tf = ° and i£s« = J, (2. 



N N 

i=l i=l 

Letting, in addition, M= no. of data points in the entire dataset, gives the following difference of loga- 
rithm^] 

N M 

A(X) = log(JJ : J (2 7 r)- d/2 |E i r5e-5(^-A i ) T sr 1 (^-Ai)) _ \ g(Y[(2Tr)- d/2 e-^ x i) (2.9) 

The simplicity of the right hand term is because we are testing the hypothesis of whether the data is 
generated from a normal distribution with constant covariance and mean which we can assume w.l.o.g. to 
be resp. white and 0. 

This gives us: 

N M 

]T J2 (l°g(2^ /2 ) - =Iog(|Ei|) - ^ ~ Aif^fe - Ai)) - (^(log(2vr-' i / 2 ) - -xjxj)) (2.10) 



i=l j£Ti ] = 1 



5 Throughout we require the following identity: z T Az = tr(zz T A). 
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The 7r terms cancel and we can multiply out the rest to get: 



E -f iog(|fi,i) - \ E E - />[>:, - xjfir^ + aJVa*) + * E ^ (2.11) 

i=i i=i jeTi i=i 

Take the second term without the factor of —1/2 in front and neglecting the outer sum: the inner sum 
distributes over the multiplication with the inverses to give: 

(£ x J^ x o) - JV<(A?VAi + - Af^Ai) (2.12) 



Which is: 



Which is: 



Which gives: 



(^xjt^xA-Niigt^fo (2.13) 



(E tr(x 3 xj£^) - 7V,tr(AiAf Sr 1 ) (2.14) 



tr((E zjarj) " N^ifXlt- 1 ) = tr^S.Sr 1 ) = tr(iV 4 7) = dN t (2.15) 



Now take the final term from equation |2 . 1 1 1 with the factor in front: 

M N N 

E x J x i = EI x J x i = J2Y1 ( x J x i ~ A^Ai + Pj Ai) (2-16) 

] = 1 i=l jGTj i=l jGTj 

Then we use the same trick as before but using the identity / as A to get: 

M N 

J2 x J x j = Z)Wtr(Si) +Nifif Ai)) (2.17) 

3=1 »=1 



So the term in equation |2 . 1 1 1 becomes : 



N , JV „ N 



E-flogdSd) - ^ E d ^ + ^ E(^tr(E,) + Nifrlfc,)) (2.18) 

i— 1 i— 1 z— 1 

Which in all cases simplifies to: 

N N A 1 W 

E -^lo g (|^|) - + - E^ tr (^) + ^''Ai Ai)) (2-19) 

i=l i=l 

So the test statistic simplifies to: 

N 

A(X) = -4iV + E^(-logdetSr + ||Ari| 2 +tr(Sr)) , (2.20) 
where iVj = |7i| is the number of data points in the i-th epoch. If every epoch contains the same number 
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of data points (iVj = ■ ■ • = Njy). then maximizing the SSA objective function (Equation 2.6 1 is equivalent 
to maximizing the test statistic (Equation 2.201 and hence minimizing the p-value for rejecting the simple 
(stationary) model for the data. 



As we will see in the application to Fault Monitoring (Section 2.4), the p- value of this test furnishes a 
useful indicator for the number of informative directions for Change Point Detection. More specifically to 
obtain an upper bound on the optimal number of directions for Change Point Detection, we use SSA to find 
the stationary sources, increasing the number of stationary sources until the test returns that the projection is 
significantly non-stationary. These sources may safely be removed without loss of informativeness for Change 
Point Detection. Removing more directions may sacrifice information for Change Point Detection; on the 
other hand, depending on the particular data set, removing additional directions may lead to increases in 
performance as a result of the reduced dimensionality. Notice here, that the procedure for obtaining the 
upper bound uses the standard SSA algorithm, whereas in the final preprocessing step for Change Point 
Detection we optimize for non-stationarity. 



To demonstrate the feasibility of using the test statistic to select the parameter d s we display, in Figure 2.3 
p-values obtained using SSA for a fixed value for simulated data's dimensionality D — 10, the number of 
stationary sources ranging from d\ = 1, . . . , 9 and the chosen parameter ranging from 1, . . . , 9. The confidence 
level p = 0.01 for rejection of the null hypothesis Hq = The projected data is stationary returns the correct d s , 
on average, in all cases. For each simulation, a dataset was synthesized (according to the details described in 



Section 2.3.1 ) of length 20,000 with 200 epochs. Then for each possible parameter setting for d s , a stationary 



projection P s was computed. Finally the value of the test statistic together with the p— value were computed 
on the estimated stationary sources. The dataset is described in Section |2.3| 



p-values for stationarity test of estimated s-sources 
1 
2 




Figure 2.3: Average p-values obtained over 100 realizations of the dataset for each setting of the actual d s . 
The number of dimensions is D = 10 and the y-axis displays the actual number for d s . The a;— axis displays 
the value for the parameter d s used to compute the stationary projection using SSA. The red box shows the 
decision made at the p = 0.01 confidence level. The red box displays the choice made using this decision rule 
for choosing the parameter d s which occurs most often. The results show that the method picks the correct 
parameter on averages, over simulations and thus represents a feasible procedure for choosing the number of 
non-stationary sources, d n . 
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Table 2.1: Overview of simulations performed and corresponding figures reporting the results. A tick denotes 
that the corresponding parameter was varied in the experiment. A cross denotes that the corresponding 
parameter is kept fixed, ("pa." denotes panel within the respective figures.) 

2.3 Simulations 

In this section we demonstrate the ability of SSA to enhance the segmentation performance of three change- 
point detection algorithms on a synthetic data setup. The algorithms arc single linkage clustering with 
divergence (SLCD) [T5] which uses the mean and covariance as test statistics, CUSUM [31], which uses a 
sequence of hypothesis tests and the Kohlmorgen/Lemm [30], using a kernel density measure and a hidden 
Markov model. For each segmentation algorithm we compare the performance of the baseline case in which 
the dataset is segmented without preprocessing, the case in which the data is preprocessed by projecting to 
a random subspace and the case in which the dataset is preprocessed using SSA. We compare performance 
with respect to the following schemes of parameter variation: 

1. The dimensionality D of the time series is fixed and d n , the number of non-stationary sources is varied. 

2. The number d n of non-stationary sources is fixed and d s , the number of the stationary sources is varied. 

3. D. d n and d s are fixed and the power q between the changes in the non-stationary sources is varied. 

For two of the Change-Point Detection algorithms which we test, SLCD and Kohlmorgen/Lemm, all three 
parameter variation schemes are tested. For CUSUM the second scheme does not apply as the method is a 
univariate method. 

For each setup and for each realization of the dataset we perform segmentation on the raw dataset, the 
estimated non-stationary sources after SSA preprocessing for that dataset and on a d n dimensional random 
projection of the dataset. The random projection acts as a comparison measure for the accuracy of the 
SSA-estimated non-stationary sources for segmentation purposes. 

2.3.1 Synthetic Data Generation 

The synthetic data which we use to evaluate the performance of Change Point Detection methods is generated 
as a linear mixture of stationary and non-stationary sources. The data is further generated epoch-wise: each 
epoch has fixed length and each dataset consists of a concatenation of epochs. The d stationary sources are 
distributed Normally on each epoch according to Af(0,ld s )- The other d n (non-stationary) source signals 
s n (t) are distributed according to the active model k of this epoch; this active model is one of five Gaussian 
distributions Qk = A/"(0, the covariance is a diagonal matrix whose eigenvalues are chosen at random 
from five log-spaced values between a\ = 1/q and a\ = q; thus five covariances, corresponding to the Qk of 
the Markov chain are then chosen in this way. The transition between models over consecutive epochs follows 
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a Markov model with transition probabilities: 




0.9 



0.025 i ^ j. 



i = 



3 



(2.21) 



Here, the indices i and j correspond to the ith and jth epoch, respectively of the dataset, so that P^j describes 
the probability of transition between the ith and jth epochs. 

In our experiments, we vary the parameters D, the total number of sources, d n (or equivalently, d s ), the 
number of non-stationary sources and q, the power change in the non-stationary sources. 

2.3.2 Performance Measure 

In our experiments we evaluate the algorithms based on an estimation of the area under the ROC curves 
(AUC) across realizations of the dataset. The true positive rate (TPR) and false positive rate (FPR) are 
defined with respect to the fixed epochs which constitute the synthetic dataset; a change-point may only 
occur between two such epochs of fixed length. Each of the change-point algorithms, which we test, reports 
changes with respect to the same division into epochs as per the synthetic dataset: thus the TPR and FPR 
are well defined. 

We use the AUC because it provides information relating to a range of TPR and FPR. In signal detection 
the trade off achieved between TPR and FPR depends on operational constraints: cancer diagnosis procedures 
must achieve a high TPR perhaps at the cost of a higher than desirable FPR. Network intrusion detection, 
for instance, may need to compromise the TPR given the computational demands set by too high an FPR. 
In order to assess detection performance across all such requirements the AUC provides the most informative 
measure: one integrates over all possible trade offs. More specifically, each algorithm is accompanied by a 
parameter r which controls the trade off between TPR and FPR. For SLCD this is the number of clusters, 
for CUSUM this is the threshold set on the log likelihood ratio and for the Kohlmorgen/Lemm this is the 
parameter controlling how readily a new state is assigned to the model. In each case we vary r to obtain 



2.3.3 Single Linkage Clustering with Symmetrized Divergence Measure (SLCD) 

Single Linkage Clustering with a Symmetrized Distance Measure is a simple algorithm for Change Point 
Detection which has, however, the advantage of efficiency and of segmentation based on a parameter inde- 
pendent distance matrix (thus detection may be repeated for differing trade offs between TPR and FPR 
without reevaluating the distance measure). In particular, segmentation based on Single Linkage Clustering 
|18| computes a distance measure based on the covariance and mean over time windows to estimate the 
occurrence of change-points: the algorithm consists of the following three steps: 

1. The time series is divided into 200 epochs for which we estimate the epoch-mean and epoch-covariance 



AUCs. 



matrices {(/ij,Sj)} 



200 

%—v 



2 



The dissimilarity matrix D s j>200x200 between the epochs is computed as the symmetrized Kullback 
Leibler divergence Z?kl between the estimated distributions (up to the first two moments), 
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where S) is the Gaussian distribution. 

3. Based on the dissimilarity matrix D, Single Linkage Clustering [18_ (with number of clusters set to k = 5) 
returns an assignment of epochs to clusters such that a change-point occurs when two neighbouring 
epochs do not belong to the same cluster. 



Stationary (1-4) and non-stationary (5-6) sources + true changepoints (red) 
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Figure 2.4: An Illustration of a case in which SSA significantly improves single linkage clustering with 
divergence: d s = 4 (no. of stat. sources), d n = 2 (no. non-stat.), q — 2.3 (power change in non-stat. 
sources) . The top panel displays the true decomposition into stationary and non-stationary sources with the 
true change-points marked. The middle panel displays the change-points which SLCD finds on the entire 
data set (sources mixed): clearly some change-points are left undetected. The bottom panel displays the 
change-points found by SLCD on the estimated non-stationary sources. 



2.3.3.1 Results 

The results of the simulations for varying numbers of non-stationary sources in a dataset of 30 channels are 



shown in Figure 2.5 in the first panel. When the degree to which the changes are visible is lower (i.e. there are 
fewer non-stationary directions in the data setup), SSA preprocessing significantly outperforms the baseline 
method, even for a small number of irrelevant stationary sources. 

The results of the simulations for a varying number of stationary dimensions with 2 non-stationary 



dimensions are displayed in Figure 2.5 in the second panel. For small d s the performance of the baseline 
and SSA preprocessing are similar: SSA's performance is more robust with respect to the addition of higher 
numbers of stationary sources, i.e. noise directions. The segmentations produced using SSA preprocessing 
continue to carry information relating to change-points for d s — 30, whereas, for d s > 12, the baseline's AUC 
approaches 0.5, which corresponds to the accuracy of randomly chosen segmentations. 

The results of the simulations for varying power q in the non-stationary sources with D = 20, d s = 16 



(no. of stat. sources) and d n = 4 are displayed in Figure 2.5 in the third panel. Both the performance of 
the baseline and of the SSA preprocessing improves with increasing power change q. This effect is evident 
for lower q for the SSA preprocessing than for the baseline. 
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An illustration of a case in which SSA preprocessing significantly outperforms the baseline is displayed in 
Figure |2.4| The estimated non-stationary sources exhibit a far clearer illustration of the change-points than 
the full dataset: the corresponding segmentation performances reflect this fact. 




Figure 2.5: Results of the simulations for Single Linkage Clustering with Symmetrized Divergence (SLCD). 
The left panel displays the results for a fixed dimensionality of the time series. D = 30 and varying d n , 
the number of stationary sources with q = 1.8. The middle panel displays the results for a fixed number of 
non-stationary sources, d n = 2 and varying d s , the number of stationary sources and with the power change 
q = 1.8. The right panel displays the results for fixed D — 20, d s — 16 and d n — 4 and for varying q, the 
power change in the non-stationary sources. Each displays the results in terms of the area under the ROC 
curve computed as per Section [2.3. 2 1 The error bars extend from the 25th to the 75th percentiles. 



2.3.4 Weighted CUSUM for changes in variance 

In statistical quality control, CUSUM (or cumulative sum control chart) is a sequential analysis technique 
developed in 1954 [JT]. CUSUM is one of the most widely used and oldest methods for Change Point Detec- 
tion; the algorithm is an online method for Change Point Detection based on a series of log-likelihood ratio 
tests. Thus CUSUM algorithm detects a change in parameter 9 of a process pg (y) [U| and is asymtotically 
optimal when the pre-change and post-change parameters are known [3]. For the case in which the target 
value of the changing parameter is unknown, the weighted CUSUM algorithm is defined as a direct extension 
of CUSUM [I], by integrating over a parameter interval. The following statistics constitutes likelihood 
ratios between the currently estimated parameter of the non-stationary process and differing target values 
(values to which the parameter may change), integrated over a measure F: 

m=( r pe ^^ yk \ dF{e 1 ) 

\J-ooPe {yj,-,yk) 

Here yj, ...,yk denote the timepoints lying inside a sliding window of length k whereby y^ indicates the 
latest time point received. The stopping time is then given as follows: 

t a = min{fc : ma,x{j < k : ln(A^) > h}} (2.23) 

The function F serves as a weighting function for possible target values of the changed parameter. In 
principle the algorithm can thus be applied to multi-dimensional data. However, as per H], the extension of 
the CUSUM algorithm to higher dimensions is non-trivial, not just because integrating over possible values 
of the covariance is computationally expensive but also because various parameterizations can lead to the 
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same likelihood function. Given this, we test the effectiveness of the algorithm in computing one-dimensional 
segmentations. In particular we compare the segmentation performed on the one dimensional projection 
chosen by SSA with the best segmentation of all individual dimensions with respect to hit-rate on each trial. 
In accordance with jj] we choose F to comprise a fixed uniform interval containing all possible values of the 
process's variance. We approximate the integral above as a sum over evenly spaced values on that interval. 
We approximate the stopping time by setting: 

t a a min{fc : \n{k k k _ w+1 ) > h] (2.24) 

The exact details of our implementation are as follows. Let T be the number of data points in the data set 
X. 

1. We set the window size W, the sensitivity constant h and the current time step as t c = W + 1 and 
9q = var({xi, xw}) and = {6>i, . . . , 9 r } — {c, c + b, c + 2b, . . . , d}. 

■ j b L~n=\ pa (vi,.".v*) 

3. If ln(A^) > h then a change-point is reported at time t c and t c is updated so that t c = t c + W and 
9o = vai({xt c -w+ii ■ ■ ■ We return to step 2. 

4. Otherwise if < h no change-point is reported and t c = t c + 1. We return to step 2. 
2.3.4.1 Results 

In Figure [2T6| in the left panel, the results for varying numbers of stationary sources are displayed. Weighted 
CUSUM with SSA preprocessing significantly outperforms the baseline for all values of D (dimensionality of 
the time series). Here we set d n = 1, the number of non-stationary sources, for all values of d s , the number 
of stationary sources. 

In Figure [276] in the right panel, the results for changes in the power change between ergodic sections q are 
displayed for D = 16, d s = 15 and d n — 1. SSA outperforms the baseline for all except very low values of q, the 
power level change, where all detection schemes fail. The simulations show that SSA represents a method for 
choosing a one dimensional subspace to render uni-dimensional segmentation methods applicable to higher 
dimensional datasets: the resulting segmentation method on the one dimensional derived non-stationary 
source will be simpler to parametrize and more efficient. If the true dimensionality of the non-stationary part 
is d n — 1 then no information loss should be observed. 

2.3.5 Kohlmorgen/Lemm Algorithm 

The Kohlmorgen/Lemm algorithm is a flexible non-parametric and multivariate method which may be applied 
in online and offline operation modes. Distinctive about the Kohlmorgen/Lemm algorithm is that a kernel 
density estimator, rather than a simple summary statistic, is used to estimate the occurence of change- 
points. In particular the algorithm is based on a standard Kernel Density Estimator with Gaussian kernels 
and estimation of the optimal segmentation based on a Hidden Markov Model j3U| ■ More specifically if we 
estimate the densities on two arbitrary epochs JSj , Ej of our dataset X with Gaussian kernels then we can 
define a distance measure d between epochs via the L2-Norm yielding: 
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Figure 2.6: Results of the simulations for CUSUM. The left panel displays the results for a fixed number of 
non-stationary sources, d n = 1, fixed power change q = 1.8 and varying d s , the number of stationary sources. 
The right panel displays the results for fixed D = 16 and d = 15 and for varying q, the power change in the 
non-stationary sources. Each displays the results in terms of the area under the ROC curve computed as per 
Section 2.3.2 The error bars extend from the 25th to the 75th percentiles. 



v ' ui,v=0 \ v 7 

9 / {Y w -Z v )\ \ ( (Z w -Z v )\ \\ 

Here, Y w corresponds to the wth point of epoch Ei and Z v to the vth. point of epoch Ej . The formula for 
the distance measure d(Ei, Ej) is derived analytically based on the expression for the kernels used for density 
estimation and simplifies the computations over calculating the densities explicitly (see the corresponding 
paper [30] for details) . The final segmentation is then based on the distance matrix generated between epochs 
calculated with respect to the above distance measure d. As per the weighted CUSUM, it is possible to define 
algorithms whose sensitivity to distributional changes in reporting change-points is related to the value of 
a parameter C: C controls the probability of transitions to new states in the fitting of the hidden markov 
model. However, in |29| it is shown that in the case when all change-points are known then one can also 
derive an algorithm which returns exactly that number of change-points: in simulations we evaluate the 
performance on the first variant over a full range of parameters to obtain an ROC curve. In addition we 
choose the parameter a according to the rule of thumb given in [3D], which sets a proportional to the mean 
distance of each data point to its D nearest neighbours, where D is the dimensionality of the data: this is 
evaluated on a sample set. The exact implementation we test is based on the papers [25] and [30j. The 
details are as follows: 

1. The time series is divided into epochs 

2. A distance matrix is computed between epochs using kernel density estimation and the L2-norm as 
described above. 

3. The estimated density on each epoch corresponds to a state of the Markov Model. So a state sequence 
is a sequence of estimated densities. 

4. Finally, based on the estimated states and distance matrix, a hidden Markov model is fitted to the data 
and a change-point reported whenever consecutive epochs have been fitted with differing states. 



(2.25) 
(2.26) 
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2.3.5.1 Results 

SSA preprocessing improves the segmentation obtained using the Kohlmorgen/Lemm algorithm for all three 
schemes of parameter variation of the dataset. In particular: the area under the ROC (AUC) for varying d s 
and fixed D are displayed in Figure 2.7 in the first panel, with D — 30. The area under the ROC (AUC) for 
varying d s and fixed d n are displayed in Figure |2~7| in the second panel, with d n = 2. The area under the ROC 



(AUC) for varying power change in the non-stationary sources p and fixed D and d s are displayed in Figure 2.7 



in the third panel, with p ranging between 1.1 and 4.0 at increments of 0.1. Of additional interest is that for 
varying d n and fixed D the performance of segmentation with SSA preprocessing is superior for higher values 
of d s : this implies that the improvement of Change Point Detection of the Kohlmorgen/Lemm algorithm 
due to the reduction in dimensionality to the informative estimated n-sources outweighs the difficulty of the 
problem of estimating the n-sources in the presence of a large number of noise dimensions. 




Figure 2.7: Results of the simulations for the Kohlmorgen/Lemm. The left panel displays the results for a 
fixed dimensionality of the time series, D — 30, varying d n , the number of stationary sources and fixed power 
change q = 1.8. The middle panel displays the results for a fixed number of non-stationary sources, d n = 2, 
varying d s , the number of stationary sources and fixed power change p — 1.8. The right panel displays the 
results for fixed D = 20, d s = 16, d n = 4 and for varying g, the power change in the non-stationary sources. 
Each displays the results in terms of the area under the ROC curve computed as per Section [2.3.2| The error 
bars extend from the 25th to the 75th percentiles. 



2.4 Application to Fault Monitoring 

In this section we apply our feature extraction technique to Fault Monitoring. The dataset consists of 
multichannel measurements of machine vibration. The machine under investigation is a pump, driven by an 
electromotor. The incoming shaft is reduced in speed by two delaying gear-combinations (a gear-combination 
is a combination of driving and a driven gear). Measurements are repeated for two identical machines, where 
the first shows a progressed pitting in both gears, and the second machine is virtually fault free. The rotating 
speed of the driving shaft is measured with a tachometer]^] 

The pump data set is semi-synthetic insofar as we juxtapose non-temporally consecutive sections of data 
between the two pump conditions. Sections of data from the first and second machine are spliced randomly 
(with respect to the time axis) together to yield a dataset with 10,000 time points in seven channels. An 
illustration of the dataset is displayed in Figure |2.8| 

6 The dataset can be downloaded free of charge at http://www.ph.tn.tudelft.nl/~ypma/mechanical.html 
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Figure 2.8: Pump Dataset: the machine under investigation is a pump, driven by an electromotor. The 
measurements made are of machine vibration at seven sensors. The data alternates between two conditions: 
normal functionality and pitting in both gears. 

2.4.1 Setup 

We preprocessed with SSA using a division of the dataset into 30 equally sized epochs and for d s , the number of 
stationary sources ranging between 1 and 6, where D — 7 is the dimensionality of the dataset: subsequently 
we ran the Kohlmorgen/Lemm algorithm on both the preprocessed and raw data using a window size of 
W — 50 and a separation of 50 datapoints between non-overlapping epochs. 



2.4.2 Parameter Choice 

To select the parameter, d s (and thus d n = D — d s ), the number of stationary sources, we use the following 
scheme: the measure of stationarity over which we optimize for SSA is given by the loss function in equation 
(2). For each d s = dim(V s ) we compute the estimated projection to the stationary sources using SSA on 
the first half of the data available and computed this loss function on the estimated stationary sources on 
the second half and compared the result to the values of the loss function obtained on the dataset obtained 
by randomly permuting the time axis. This random permutation should produce, on average, a set of 
approximately stationary sources regardless of non-stationarity present in the estimated stationary sources 
for that d s . In addition a measure of the information relating to non-stationarity lost in choosing the number 
of stationary sources to be d s , we define the Baseline-Normalized Integral Stationary Error (BNISE) as 
follows: 

d , <d (T X i{L d ,{A ',X)) 



Here, L d >(A ,X) denotes the loss function given in equation 2.4 on the original dataset with stationary 



parameter r and L r (A , X') the same measure on a random permutation X' of the same dataset. That 



is using the notation of Equation 2.4 here on the right hand side, L^A -1 , X) — L(A^ 1 ) evaluated when 
specifying the free parameter to SSA d s = d. 

The motivation of the BNISE is that due to random fluctuations, corresponding to sample sizes, even a 
stationary dataset should be measured as slightly non-stationary by the SSA loss function. Thus we measure 
the deviation in loss between the dataset at hand and the expected loss of a stationary dataset estimated by 
using the same dataset shuffled on the time axis. This difference is then normalized by the standard deviation 
over losses estimated under such shuffles. 
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2.4.3 Results 

The results of this scheme and the segmentation are given in Figure |2.9| For d s = 6 we observe a clearly 
visible difference between the expected loss function value due to small sample sizes and the loss function 
value present in the estimated stationary sources. Similarly looking at the p-values, we observe that for 
d s = 1,2 we do not reject the hypothesis that the estimated s-sources are stationary whereas for higher 
values of d s we reject this hypothesis. This implies that d n > 5. To test the effectiveness of this scheme, 
segmentation is evaluated for SSA preprocessing at all possible values of d s . The AUC values obtained using 
the parameter choices d s = 1, . . . , 6 for SSA preprocessing as compared to the baseline case are displayed in 
Figure [2~9) An increase in performance with SSA preprocessing is robust, as measured by the AUC values, 
with respect to varying choices for the parameter d s as long as d s is not chosen < 2. Note that, although, 
for the dataset at hand, there exists information relating to change-points in the frequency spectrum taken 
over time, this information cannot be used to bring the baseline method onto par with preprocessing with 
SSA. We display the results in Figure |2.11| for comparison. Here, segmentation based on a 7-dimensional 
spectrogram evaluated on each individual channel of the dataset is computed. The best performance over 
channels for segmentation on each of these spectrograms is lower than the worst performance achieved on 
the entire dataset without using spectral information, with or without SSA. 
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Figure 2.9: Pump Dataset: schemes for selecting the parameter d s . Top left: the measure BNISE for 
increasing values of d n . Top right the value of the error function as compared to randomly generated data. 
Bottom left: the AUC performances for various values of d n . Bottom left: p-values on the estimated s-sources. 

2.5 Conclusion 

Unsupervised segmentation and identification of time series is a hard problem even in the univariate case 
and has received considerable attention in science and industry due to its broad applicability ranging from 
process control and finance to biomedical data analysis. In high dimensional segmentation problems, different 
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Figure 2.10: Pump Dataset: all segmentations are computed using Kohlmorgen/ Lcmm with the number 
of change-points TV specified . The baseline corresponds to segmentation without SSA preprocessing. The 
middle panel displays segmentation with SSA preprocessing. The bottom panel displays the real change- 
points superimposed over the raw dataset. 



Performance via spectral method 
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Figure 2.11: Pump Dataset: performance on spectograms computed on individual channels of the datatset. 
Each spectrogram is computed with a window length of 50 datapoints and overlap of 49 datapoints. 7 
frequency band windows are used to compute a timeseries of size 7 x 10, 000. 



subsystems of the multivariate time series may exhibit clearer and more informative signals for segmentation 
than others. We have shown that it may be beneficial to decompose the overall system into stationary and 
non-stationary parts by means of SSA and use the non-stationary subsystem to determine the segmentation. 

This generic approach yields excellent results in simulations and we expect that the proposed dimension- 
ality reduction will be useful on a wide range of datasets, because the task of discarding irrelevant stationary 
information is independent of the dataset-specific distribution within the informative non-stationary sub- 
space. Moreover, using SSA for preprocessing is a highly versatile because it can be combined with any 
subsequent segmentation method. 

Applications made along the same lines as in the present thesis are effective only when the non-stationary 
part of the data is visible in the mean and covariances. The method we have considered may be thus made 
applicable to general datasets whose changes consist in the spectrum or temporal domain of the data by 
computing the score function as a further preprocessing step (?]. 



Chapter 3 

Classification under Non-Stationarity 



Classification is a widely studied task in Machine Learning [12 , 32J. Less well studied, however, is classification 
under the assumption that training and test sets differ in distribution. The situation under which training 
and test sets differ arises naturally when both are drawn from a non-stationary time series. In particular, 
classification under non-stationarity poses a serious challenge for, for instance, the design of Brain Computer 
Interfaces [9]. In a wider setting, classification under the assumption of difference of distribution between 
test and training domains but without the assumption that both are drawn from a single time series has been 
studied, for example, by Moreover, online learning has been studied in the contribution of Murata et 
al. [37] • In the following, however, we assume that the training and test distributions are drawn from a single 
time series. In particular, we focus on the two way classification problem under non-stationarity whereby the 
class distributions on each epoch may be modeled as Gaussians with differing means. No algorithm has been 
proposed in the non-stationarity literature which aims to robustify classification against non-stationarity for 
this setting. This is exactly the aim of the present chapter. 

Some progress has been made, however, in the Brain Computer Interfacing literature in two way classi- 
fication under non-stationarity. In Brain Computer Interfacing (BCI), non-stationarity may be imposed by 
artifacts and learning related adaptation jSj. With a view to improving classification perfomance for BCI, 
the contributions: [9, 56, 53 present classification based, variance based and adaptation based methods for 
classification under non-stationarity. The method presented in [56 adapts the standard prior feature extrac- 
tion step for BCI, namely CSP [8J, to yield sCSP, i.e. stationary CSP. CSP functions on the assumption 
that the variance of two class distributions differ: a projection of the data is then sought which maximizes 
the difference between the variance of the two classes. A linear discriminant may subsequently be trained 
on the data consisting of variances within single trials [8J. The data thus extracted as variances on CSP 
features should be suitable for classification, under the stationarity assumption and the assumption that the 
classes differ in covariance. sCSP adapts this methodology by extracting features which are discriminative 
in variance and simultaneously stationary. This approach has been shown to alleviate, to some extent, the 
problem of non-stationarity in BCI. 

A drawback of the sCSP method is that it is specifically tailored to extracting discriminative features when 
the class distributions differ in covariance: thus sCSP is highly relevant to classification in Brain Computer 
Interfacing but does not necessarily extend to the general classification setting. In particular, sCSP is not 
applicable when the class distributions differ on each epoch but not necessarily in covariance. We aim to 
address this problem by designing a corresponding method below. Suppose, therefore, instead, that we only 



23 



SSA FOR CLASSIFICATION PROBLEMS 



24 



assume that we have set features which are separable but non-stationary; suppose, further, that the classes 
are discriminable with a given probability without further processing and may be modeled by their mean and 
covariance and the covariance on both classes is identical: that is to say that Linear Discriminant Analysis 
(LDA) [TS] is the correct ansatz. We will propose a method which improves classification under non-stationary 
for the LDA setting: we assume that the homoscedastic Gaussian assumption is valid within a transient time 
window but that the data may be non-stationary over a longer time frame; the algorithm will then find 
a classification direction which is obtained via optimization of a loss function whose minimization rewards 
separability and penalizes non-stationarity. We will test the resulting algorithm in Section [3. 4| and Section [3. 5 1 
on BCI data, since BCI represents a case-study where non-stationarity poses a serious problem: subsequently, 
the results will be rigorously analyzed and tested. Furthermore, we briefly consider the theoretical background 
to classification under non-stationarity in Section ??. 



3.1 SSA for classification problems 

The question now arises, before attempting to design new algorithms: may we use SSA in order to extract 
discriminative features which are stationary? The answer is: not in general. Firstly, because SSA does not use 
explicit class information. One may argue that we may choose epochs intelligently and thus use SSA to boost 
classification: equal numbers of samples from each class are used in each epoch [54 . Thus, in this fashion, 
SSA may be used to look for stationary directions. However this approach fails whenever non-stationarity is 
present which leaves the mixture distribution (with a balanced mixture) over classes stationary. 

One may avoid, in one respect, this deficiency brought on by an unmodified application of SSA by 
treating each class separately but deriving a loss function which stipulates that the distribution of each 
class, but not the joint distribution over classes, should be stationary at a zero of the loss function [35]; the 
resulting algorithm has been aptly named Group- Wise SSA. This approach, however, has the drawback that, 
although differences between classes are not treated as non-stationarities, there is no guarantee that the class 
information is preserved in the features which are thus derived. A second drawback of this Group- Wise SSA 
is that there may be non-stationarities projected out of the data which are not detrimental to classification. 



Figure 3.1 illustrates the difficulty with using SSA for classification and Figure 3.2 illustrates the dif- 
ficulty using LDA. In the first case, SSA may select the stationary direction but this direction contains 
no discriminative information. In the second case, LDA chooses the most discriminative direction on the 
training data which is, however, not-orthogonal to the most non-stationary direction. If we assume, that 
the non-stationarity which is present in the transition from training to testing, is poorly reflected in the 
training non-stationarity, then, LDA provides a highly suboptimal solution. Thus, neither SSA nor LDA 
may sufficiently cater for classification under non-stationarity in a unified manner. We aim to address these 
deficits in Section l3~3l 

The solution we propose, therefore, is that we trade off stationarity against discriminability. We describe 
this ansatz in detail in the following sections. 



3.2 LDA 

One standard approach for training a classifier is Linear Discriminant Analysis (LDA), first proposed in 
Fisher's landmark article of 1937 [15] ; LDA outputs a hyperplane with a decision bias given classes modeled 
by only their mean and covariances: given that each of the two classes are normally distributed with identical 
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Figure 3.1: The figure provides an illustration of the unsuitability of SSA for classification. The ellipses 
correspond to the class covariances in each class over two epochs. The y-direction corresponds to the direction 
chosen by SSA and is not discriminative. The x-direction is non-stationary but provides discriminative 
information. Thus the illustration demonstrates that using SSA may discard information which is important 
for classification. 

covariance parameters, LDA is the Bayes optimal classifier [21 . The decision criterion given by the hyperplane 
with normal vector w and bias b [T5] is given as follows: 



w = (Si + S 2 ) 1 (/xi - (Jq) 
b = w ritl±M 



(3.1) 
(3.2) 



Here, the fit refer to the class means and the Sj to the class covariances. The decision rule is then 
sign(w T x + b). According to the LDA ansatz, w is computed as the direction which maximizes the ratio 
between the within class covariances (fii — ^2) T (Mi — M2) an d between class covariance, Si + S 2 . The ratio 
maximized is thus: 



(w T (/ii - fa) 2 ) 



(3.3) 



u> T (£i + £ 2 )u> 

Optimizing a projection to maximize this ration will often fail for robust classification under non- 
stationarity, because the ratio does not take this non-stationarity into accound; however, this ratio will 
be incorporated into the error function described in the following section. 
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Figure 3.2: The figure provides an illustration demonstrating the unsuitability of LDA for classification under 
non-stationarity. The ellipses are schematics of covariances in a two dimensional space in each class (denoted 
by color) over two epochs. The diagonal direction is the most discriminative direction. The y-direction is 
stationary and discriminative and thus may provide a more robust classification direction than the LDA 
solution. Thus the illustration demonstrates that by using LDA may choose non-stationary directions, which 
provide poor generalization error on the test set. 

3.3 sLDA 



We will now propose a loss function which finds a hyperplane which, in the ideal scenario, performs classifica- 
tion (of course at error-rates above chance) but is robust against non-stationarities. Because we incorporate 
the LDA ratio into the loss function, we call the method sLDA, where the 's' stands for 'stationary'. Thus 
sLDA is derived as classification performed on the direction given by Equation |3T] and using the bias described 
below. 

A hyperplane, as a decision boundary, performs a one dimensional projection of the data (w) and makes 
a decision based on a threshold (b). It is thus only of interest that the data be as stationary as possible on 
this one dimensional projection. This implies that, although the data may be very non-stationary, we only 
need to find a single dimension which is both discriminative and stationary to succeed in our task. Using 
sLDA we aim to choose a direction for classification using a trade off loss function based on the ratio used 
by LDA but catering for non-stationarity. For the class decision we subsequently use the same threshold as 
for LDA. 



sLDA chooses a decision direction which is discriminative and stationary. 



SLDA 27 



The trade off function, which we maximize in order to choose a direction for classification w (a vector), 

is: 

L a (w) = aJ ^j? 1 '^ + (!-«)££ ®n S (w; Ej,&, (3.4) 

$ ns is the Kullback-Leibler divergence between the average epoch mean and the ith epoch gaussian 
estimated from the epoch data: this does not use the assumption of normalized means and covariances. The 
parameters Ej and fij denote the empirical mean and covariance of the j class evaluated on the training 
data. Thus: 



$ M (ifl;/i; ) S;,/i i ,E i ) = D KL (Ei\\E) 



(w t (l4 - /L;)) 2 1 fw T ^w\ I w T T, 



1 - log 



w 



2w T T,jW 2 \w T T,jW I \w T T,jW I 

Now we calculate the derivative of L a w.r.t. w. This is a simple matter as we are no longer dealing with 
rotation matrices as in the first chapter of this thesis. The derivative is calculated as follows: 



d_ / (^ffii-fe) 2 ) = 1 / (tt, r (fo-fr) 2 A 1/2 2(w T (il 1 -ti 2 mi--p2) 
dw y w T (£i + E 2 )w 2 ^ w T(g x + g 2 ) w y w T(g 1 + g 2 ) w 

1 / (w T (pi - £ 2 ) 2 ) \ _1/2 2(Ei + E 2 )w(u; T (^ 1 - /2 2 ) 2 ) 



2 V t« r (Ei + E 2 )«; / (w T (E 1 + E 2 H 2 



5 n 2(^( M 5 - fo))( M } - 2(E J -)^(^ T (^ - M,-) 2 ) 



dw w T E jW (w T V Jt 

lw T I^w2Z i J w -w T Z)w2TJw ( w t y)w\ 1 w T YJw2T,)w - w T Z l jW 2YJw 



x 



2 (w T E%) 2 ^ r E}wy 2 (w^SJu;) 2 

In practice we may attempt choose the regularization parameter a by cross validation. So the final 
procedure for computing an sLDA classifier given set features is as follows: 

3.3.1 Final Procedure For Training sLDA 

Here is the final procedure for training the sLDA classifier on set features: 

1. For each of k cross validation folds compute Wi as the maximizer, by gradient ascent Q of L ai (w) with 
||w|| = 1 for each choice of set parameter values a\, a*, a r . 

2. Set bi — wj (iii + /i 2 ) for each of these settings. 

3. Perform classification for each of these settings via the standard decision rule: x in class 1 if wf x+bi < 0, 
class 2 otherwise. 



J We use the function fmincon included in the matlab optimization toolbox to obtain this minimum: see:- http://www. 
mathworks . de/products/optimizat ion/index .html 
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4. Choose a to be the parameter setting performing best on average. 

5. Retrain w and b on the entire training set for this fixed optimal a as the maximizer of L a {w). 

Of course, for fixed a the cross-validation folds may be omitted and the direction w is simply selected 
as the direction maximizing the function L a (w). To select aj,...,a r , in practice, a small set of sensible 
parameter values for en are chosen by hand, from which the best is chosen by cross validation. 



3.4 Simulations 

The Simulations in the section aim to test the performance of sLDA on toy data and investigate the conditions 
under which sLDA should yield improvements in classification under non-stationarity. 



3.4.1 Overview of Simulations 



The simulations are divided into 3 subgroups. The first subgroup, described in Section |3.4.2.1[ is designed 
to check that any improvements observed using sLDA are not due to implicit regularization or robustness, 
for instance, due to the use of gradient based optimization. (Regularization as a result of early stopping in 
gradient based descent is a well documented phenomenon in Machine Learning; see, for example, |581 HO], 
for details.) Robustness refers to the ability of estimator to not produce drastically different results in the 
presence of outliers (see, for example, [2J for details). In order to test whether regularization may be provided 



as a result of gradient based optimization we compare closed form LDA (see Section 3.2 1 with the maximum 



argument w for the Fisher ratio: (mi jfe) ) obtained via gradient based optimization. We call this method 
for obtaining w, gradLDA. The various simulations within this regularization and robustness analysis section 
examine if and when exactly these phenomena may be expected through the comparison of the performance 



of gradLDA and LDA. In the first simulation (3.4.2.1 Simple), the data are generated as a mixture of sources 



each of which has approximately equal separation. The second simulation (3.4.2.1 Outliers) uses the same 



dataset as the first but including outliers. The third simulation (3.4.2.1 Hard) tests performance using 



datasets where very few directions in the data are discriminative, that is to say, the problem of finding a 



good discriminative direction is harder. The fourth simulation (3.4.2.1 Tapered Difficulty) tests the transition 



which occurs when moving from a dataset similar to |3.4.2.1| Hard to a dataset similar to |3.4.2.1| Simple. 



The second group of simulations (see Section 3.4.2.21 tests the performance of sLDA in choosing a specific 
discriminative yet stationary directions within the dataset when non-stationary and non-discriminative yet 



stationary directions are also present. The first simulation (3.4.2.2 Simple) uses only three epochs and 3 di 



mensions whereas the second simulation (3.4.2.2 Realistic) uses 7 epochs and 6 dimensions. The motivation 



for the section is simply to test the accuracy of sLDA over LDA in picking a stationary yet discriminative 
subspace without considering the more complicated issue of whether this implies higher generalization issue, 
which we discuss in Section |3.4.3.2| For both of these simulations there is a stationary yet discriminative 
direction, but also non-stationary and discriminative directions and stationary yet non-discriminative direc- 
tions. In each case, we test the performance of whether sLDA is more effective at picking the stationary and 
discriminative direction. 

The third and final group of simulations are designed to test what difference between test and training 
distributions is necessary in order to guarantee that the stationary and discriminative direction chosen by 
sLDA provides a classification direction which produces higher test error than the direction chosen by LDA. 



To this end, the first simulation (3.4.2.3 s-space small) investigates the difference between sLDA and LDA 
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when the number of stationary directions is small, whereas the second simulation (3.4.2.3 s-space large) 



investigates the difference between sLDA and LDA when the number of stationary directions is larger. 

3.4.2 Data Setups 

3.4.2.1 Sanity Checks for Regularization 

The first set of simulations test for the presence of regularization and robustness effects of gradient based 
LDA (gradLDA) versus standard, closed form, LDA. In particular, the following simulations are reported 
upon: 



Simple: The training data consists of 75 points drawn randomly from distinct Gaussians for each class 
and the test data, 150 points from each class. On each source the classes are generated in class 1 as 
i.i.d. samples from Af(0, 1) and in class 2 from M (0.7, 1). The dimensionality of the data is D = 6. We 
test the performance in terms of classification error between LDA and gradLDA, i.e. sLDA with a = 1. 
The results are displayed in the top left panel of Figure |3.3| 



Outliers: We next frame a data set which includes outliers added to the data used in the previous 
simulation (outliers are one type of non-stationarity) . Outliers are added sparsely at random time 
points uniformly to each class; the outliers are generated as samples from a Gaussian with mean 
20 times the size of the original data set and are added to the data on each class. As before, the 
training data consists of 75 points from each class and the test data, 150 points from each class. The 
dimensionality of the data set is D = 6. The results are displayed in the top right panel of Figure [3~3) 



• Hard: We further investigate the difference between LDA and gradLDA (see above in Section 3.4.1 1 as 
follows: we set the variance of each class to 1 on each source. 5 of the sources are chosen with class 
means differing by 0.2. For the remaining source we vary the degree of separation observed. In this 
simulation, outliers are not included. The results reported as classification errors are displayed in the 
bottom left panel of Figure |3.3| 



• Tapered Difficulty In the final simulation in this section, we retain the data setup from the previous 
simulation but we hold the separability (absolute value of the difference in means as Gaussians) of the 
most separable source constant (difference in means = 1.1), whilst increasing the separability of the 
second and third most separable sources from 0.1 to 1.1. The remaining source means are separated by 
0.2 as before. The results are displayed in the bottom right panel of Figure |3.3| In all cases the data 
are randomly mixed orthogonally. 

3.4.2.2 Subspace Based Comparison of sLDA and LDA 

In the second set of simulations we investigate simple non-stationary toy data setups and evaluate performance 
in terms of the angle between the normal vectors to the decision hyperplanes and stationary directions within 
the data sets. 

• Simple: The first data set has three dimensions: D = 3. In each case, all epoch distributions are 
Gaussian with variance 1 on each source. The first source is stationary and has no separation between 
classes. The second source has separation and is stationary; the classes are drawn from Gaussians 
with unit variance and means, and 0.7 respectively. The third dimension has separation and non- 
stationarity; the data on this 3rd dimension consists of 3 epochs; the outer 2 epochs are drawn from 
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normal distributions with unit variance and means 0, 1 respectively. The inner epoch is drawn from 
a Gaussian distribution which has in class one mean randomly chosen from the uniform distribution 
on [— a ns — 1,0] and in class 2 randomly chosen from the uniform distribution on [l,a ns ] where a ns 
corresponds to the non-stationarity level. The data are subsequently mixed orthogonally. 

The aim is to find the second dimension as a subspace and not the first or the third; the second has 
lower but stable separation as opposed to the 3rd dimension which has higher but unstable separation. 
The results are displayed in Figure |3.4| 

• Realistic: We next investigate the effect observed in the previous simulation in further detail: in 
particular, we increase the number of dimensions to D = 6. We investigate, again, for simplicity, non- 
stationarity in mean. We mix 6 sources orthogonally, which consist of samples from both classes. In 
each case, on each source and each epoch, 50 samples are drawn from each class. On each epoch, on 
source 1, we add 2 independent random numbers to the mean of both classes. The remaining sources 
are stationary. In addition, each source is generated with a separation between the means of the classes, 
as follows. For each epoch i, the classes on source 1, nj and n 2 are distributed according to Gaussians, 
where cij ~ -<V(0, n). 



nj(i) ~ Af( ai .l) (3.5) 
nj{t) ~ N{T + a h l) (3.6) 

(3.7) 

On source 2, viz. s sep , the classes are distributed according to stationary distributions. 



S 2 sep (t) ~ Af(b,l) (3.9) 
And, for j = 1, . . . , 4, the 3rd to 6th sources are distributed as follows: 

~ Af {0,1) (3.10) 
s 2 nj (t) ~ Af(c,l) (3.11) 

The data are randomly mixed orthogonally. On each realization of the data set, we compute directions 
wlda an w s lda and measure the angle between each w and the projection to the second source, P s ^p . 
We set t — 2 and use epochs i = 1, 7, each of which contains 11 data points per class. The exact 
parameter values we fix are as follows: r = 2, b = 1.2, c = 0.2. The results over 100 realizations of the 
data set for each value over varying trade-off parameter a and non-stationarity level k are displayed in 
terms of subspace angle in Figure |3.5| 

3.4.2.3 Investigation of transfer non-stationarity to test-phase 

Finally, we assess the relationship between the non-stationarity observed on the training data with the non- 
stationarity observed in the transfer between training and test data in terms of classification test error: we 
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aim to assess the discrepancy between these non-stationarities which guarantees, on average, an improvement 
in classification performance using sLDA over LDA. 

• s-space small: In the first of these transfer simulations, we retain the second data set used in the fourth 



set of simulations (Section 3.4.2.21, using the first 7 epochs as the training data each comprising 11 
data points per class. The 8th and final epoch generated is held back as a test set, comprising 150 
data points per class and the parameter a% is now fixed and varied between and 3. In addition, 
the parameter value k = 0.5 is fixed. We evaluate average classification performance over the range 
a 8 = 0.2, 0.4,..., 3.0. 

s-space large: In the second of these transfer simulations, we retain the same data set as in the previous 
simulation, with the only difference being that we enlarge the number of significantly separable but 
stationary directions to 2. That is, we exchange s no i sei for an independent copy of s sep and additionally 
increase b so that 6 = 2. In addition, the parameter r is reduced so that r = 1 and c is decreased so 
that c = 0. 

In both transfer simulations, we study gradLDA as a sanity check against regularization. 

3.4.3 Results and Discussion 

In the current section we describe the results for the simulations described in Section l3~4l A 
3.4.3.1 Results for Sanity Checks for Regularization 

The results for the simulations described in Section |3.4.2.1| are displayed in Figure |3.3| The comparison 



between gradLDA and LDA on the first data set (3.4.2.1 Simple) show that when there arc multiple dis- 



criminative directions in the data, gradLDA is outperformed by LDA; see the top left panel of the figure. 



Similarly, for the second data set (3.4.2.1 Outliers), see the top right panel, sLDA provides no robustness 



effect in the presence of outliers. Thus, gradLDA does not necessarily regularize or robustify LDA through, 



for instance, early stopping. However, the results from the third and fourth data sets (3.4.2.1 Hard and 
Tapered Difficulty) show that when the classification problem is more difficult, improvements in classification 
performance of up to 1% are possible using gradLDA. Thus, some regularization effects may be obtained, 
when using gradLDA, when the classification task is difficult. 

3.4.3.2 Subspace Based Comparison of sLDA and LDA 

Note that we evaluate the quality of the performance of sLDA first in terms of angles between subspaces, 
rather than in terms of classification accuracy. This is because classification accuracy is dependent on the 
difference in distribution between training and test sets: given set training data, we may choose the test 
data arbitrarily. We can produce test data which induces an arbitrarily low test performance corresponding 
to the LDA solution over the sLDA solution: for example, we can frame data sets whereby a slightly non- 
stationary subspace in training corresponds to a highly non-stationary subspace in testing. This is achieved 
by choosing a non-stationarity in testing which does not reflect the level of non-stationarity observed in 
training: see Figure |3.7| Thus, any evaluation scheme based on classification accuracy for a fixed distribution 
of non-stationarity is data dependent and lacks objectivity. 



In the first simulation, (3.4.2.2 Simple) sLDA finds the correct direction to within 10 to 20 degrees, 



whereas LDA often chooses the wrong direction. The imperfection in performance of sLDA is due to the 
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Figure 3.3: The figure displays the results from the simulations described in Section [3.4.2.1| The top left panel 
displays the results from the first simulation (Simple) and demonstrates that LDA outperforms gradLDA 
(sLDA + a = 1) for stationary data generated by sources which are uniformly separable. The top right 
panel displays the results from the second simulation (Outliers) and shows that gradLDA does not robustify 
LDA against outliers. The bottom left panel displays the results from the third simulation (Hard) and the 
bottom right displays the results from the fourth simulation (Tapered Difficulty) and show that when the 
discriminative subspace is smaller (the problem is harder) , then regularization effects may be obtained using 
gradLDA. All performances reported in given in terms of errorates on the y-axis. In the boxplots, the whiskers 
describe the full extent of the data, the box corresponds to the lower quartile, median and upper quartile. In 
the bottom left panel, the a;- values represent the level of separation on the single separable source, whereas 
in the second panel, the x-values represent the level of separation on the second and third separable sources. 



fact that the non-stationarity on the first dimension may not, in every case, be sufficient to outweigh the 
discriminability available in that direction. 

In the second simulation ( 3.4.2.2| Realistic), see Figure 3.5 for the data-set used, sLDA with low values 
of the hyperparameter a achieve lower angles with the discriminative yet stationary source in the data 
set than LDA. As the non-stationarity in training grows, however, LDA improves, since the high level of 
non-stationarity implies low discriminability. 

A further subtlety to note is that, if the parameters of the epoch distributions on the non-stationarity 
directions of data space are drawn themselves from a probability distribution, then the non-stationarity may 
be ignored provided enough epochs are available. For example, if one assumes that the data on each epoch is 
distributed according to a Gaussian distribution and that the non-stationarity consists of non-stationarity in 
mean, then the distribution over the data r(x; a) obtained by integrating out the distribution over parameters 
for the mean is also a Gaussian: 



SIMULATIONS 33 




Figure 3.4: The figure displays the accuracy in degrees between the stable and separable direction and the 
direction chosen by resp. LDA and sLDA with a = 0.1 for the 1st simulation (3.4.2.2 Simple) described in 
Section |3.4.2.2| The goal is to pick a single direction which is separable and stationary from a three dimen- 
sional space which contains non-separable, stationary directions and separable, non-stationary directions. 
The z-axis displays the non-stationarity level: this corresponds to a ns in the text above and corresponds 
roughly to the average deviation of the epoch means from the mean over all epochs. The y-axis corresponds 
to the accuracy of methods in choosing the correct direction measured in degrees. From the figure we can see 
that the higher the non-stationarity level, the more accurate sLDA is. For high non-stationarity levels, LDA 
is inaccurate. However when the non-stationarity level is very high, LDA becomes less inaccurate, because 
lower separation is induced on the non-stationary source, through this non-stationarity. 



p(x;6) = AT(x;e,4) (3.12) 

q(0;a,P) = Af(9;a;/3 2 ) (3.13) 

r(x;a,0) = | J p(x; 6)q(6; a., (3)d6 (3.14) 

= - / e 2a " , e ^f^dd (3.15) 
ZJ v /2^ ^1 

i r i i -itz^i 



=e 2 "« e ~^~dd (3.16) 

ZJ 

= N{x- a, a 2 + (3 2 ) (3.17) 

Thus in the case whereby non-stationarity can be construed as draws of parameters from a probability 
distribution, given enough data, no extra allowance for non-stationarity should be made. Thus the sLDA 
ansatz requires that such a formulation is invalid. In addition, for sLDA to aid classification, the topographies 
of the non-stationarities should be constant over time: i.e. the spatial location of the non-stationary sources 
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Figure 3.5: The figure displays the accuracy in radians between the stable and separable direction and the 
direction chosen by resp. LDA and sLDA for the 2nd simulation described in Section 3.4.2.2 (Realistic). 
The data set consists of six dimensions composed of a single stationary and separable source a separable 
and non-stationary source and four noise sources. The goal is to recover the single stationary and separable 
source. The x-axis displays the non-stationarity level: this corresponds to the average deviation of the epoch 
means from the mean over all epochs. The y axis corresponds to varying values of the trade-off parameter a 
in the sLDA loss function L a (w). The color displayed on the panels corresponds to the accuracy in radians 
in choosing the correct direction for each of the methods. For low non-stationarity levels, sLDA only chooses 
the correct direction for smaller values of a but for higher non-stationarity levels, sLDA at higher values of a 
chooses the correct direction. Similarly to the results displayed in Figure |3T] LDA is more accurate at higher 
non-stationarity levels due to the lower separation, on average, induced on the non-stationary directions in 
the data setup. 



affecting classification should be the same in the test and in the training data. These requirements, taken 
together, constitute very strong assumptions on the data generating process. This requirement is investigated 
in the classification simulations. 

3.4.3.3 Results of Transfer Non-Stationarity to Test-Phase Simulations 

The simulations in Section [3.4.2.3| are aimed at achieving a partial intuition of the subtleties discussed in Sec- 
tion 3.4.3.2 In particular, at a fixed parameter value for the trade-off parameter a — 0.1 which achieves high 
subspace accuracy in the subspace simulations, we observe in Figure [3~6| that this does not necessarily entail 
improvement of sLDA over LDA in the classification task, even under arbitrary non-stationarity. Rather, 



it is shown that, if the stationary but discriminative directions (3.4.2.3 s-space small) are not significantly 
more separable than the non-stationary but discriminative direction, then improvement is not possible. On 
the other hand, if there are a number of stationary directions which are discriminative and there is one 



non-stationary (3.4.2.3 s-space large) but moderately discriminative direction, then improvement is possible 
using sLDA over LDA. In addition, it seems that gradLDA fares even worse than LDA in this case: this 
suggests that gradLDA loses classification accuracy through poor optimization. This suggests, in addition, 
that further improvement, in this case, is plausible, given a tidier optimization approach for sLDA. 

An example in which sLD A yields improvement over LDA is displayed in Figure |3.7| Whilst the direction 
chosen by LDA on the training data displays clear separation, the opposite is true for the LDA test data. On 
the other hand, while the separation on the direction chosen by sLDA on the training data is not so clear, 
the distributions remain more stable in the transition to the test data. 
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Figure 3.6: The figure displays the classification accuracy for resp. LDA and sLDA whilst varying the transfer 
non-stationarity from training to test. For both panels, the data set consists of six sources mixed orthogonally. 
One of these sources is non-stationary and separable, several are noise directions and resp. 1 or 2 of these 
sources are separable and stationary. See Section |3.4.2.3| for details of the data setup. For each panel, the x- 
axis displays the non-stationarity level: this corresponds to the average deviation of the epoch means from the 
mean over all epochs, whilst the y-axis corresponds to the classification error achieved. The left hand figure 



displays classification accuracy when the discriminative subspace has dimension approximately 2 ( 3.4.2.3 



s-space small) whilst in the right hand figure, the discriminative subspace has dimension of approximately 
3 ( 3.4.2.3 s-space large) and the n-source provides less discriminative information than in the simulation 
corresponding to the left hand panel. 



3.5 Experiments 

3.5.1 Description of Experiments 

We evaluate the performance of the above method, sLDA, on BCI data combined with preprocessing with 
CSP making a reduction to a fixed number, hcsp, of features before classification. The data consist of 80 BCI 
subjects taken from the study [7]. The baseline procedure for classification using CSP with LDA, which we 
use, follows the standard template for processing EEG for motor imagery classification [5]. See Algorithm [I] 

Algorithm 1 CSP+LDA 
1: Find the discriminative frequency band. 
2: Compute a number, hcsp, of feature directions. 

3: For each trial compute the variance giving 2m data points, one for each trial, with m in each class. 
4: Train the LDA classifier on the entire training data consisting of these 2m data points. 
5: Test the classifier's performance on the test set. 

For sLDA the procedure is defined as per Algorithm [2] 

In addition we test selection of a via cross validation as in Algorithm [3] 

Although we argued above that by the nature of the problem, SSA is an unsuitable method for classifi- 
cation, we include results for SSA for completeness and by way of a check. In particular we preprocess with 
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Figure 3.7: The figure displays an example in which the y-direction is non-stationary and the x-direction 
is stationary; although the y-direction has higher separation in training, the x-direction is to some extent 
separable and is more stationary. In this case, the black decision line may yield a lower error in testing than 
the LDA solution in blue. 



Algorithm 2 CSP+sLDA with a = a 
1: Find the discriminative frequency band. 
2: Compute a number, hcsPi of feature directions. 

3: For each trial compute the variance giving 2m data points, one for each trial, with m in each class. 
4: Train the sLDA with a = a classifier on the entire training data consisting of these 2m data points. 
5: Test the classifier's performance on the test set. 



SSA as per Algorithm [4] 

By way of an extra sanity check, we test against rLDA [6 : rLDA works by regularizing the covariance 
estimates against extreme eigenvalues which occur when estimating covariance from small sample sizes. By 
testing the difference in performance between the stationarity-penalizing method sLDA and the regularizing 
method rLDA, we aim to check that any increases in performance are not (completely) due to implicit 
regularization (for small sample size) of the covariance estimates. The steps are displayed in Algorithm [5] 

As yet another sanity check, we test the following method: we perform LDA via gradient based opti- 
mization: that is to say sLDA with the trade off parameter set to unity: a = 1. We will often refer to this 
approach as "gradLDA". The steps are displayed in Algorithm [6] 

The rational for this is to test whether any improvements in performance are due to, for instance, implicit 
regularization at local minima or in early stopping of gradient based optimization [58 . 



Algorithm 3 CSP+sLDA c.v. 
1: Find the discriminative frequency band. 
2: Compute a number, hcsPi of feature directions using CSP. 

3: For each trial compute the variance giving 2m data points, one for each trial, with m in each class. 

4: Perform /c-fold classification on the training data for trade-off parameter settings ax,..., ccr, on each fold 

evaluating the performance using sLDA at the chosen parameter setting a r . 
5: Retrain the sLDA classifier on the entire training data consisting of these 2m data points using the best 

parameter setting a. 
6: Test the classifier's performance on the test set. 
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Algorithm 4 SSA + CSP+LDA 
1: Find the discriminative frequency band. 
2: for d s = D - 30, ...,D- 1 do 

3: Perform a fc-cross validation fold by reducing the number of dimensions to d s with SSA: 
perform the CSP+LDA (Algorithm [l]) on P S X where X is the input data and P s is the 
projection to the stationary sources found using SSA. 

4: end for 

5: Chosen the optimal d s as the d s which minimizes the cross validation error. 

6: Retrain the standard CSP+LDA method (Algorithm [I]) on P S X for the projection, P s , to the stationary 
sources found with SSA at d s . at this optimal parameter setting. 



Algorithm 5 CSP+rLDA 
1: Find the discriminative frequency band. 
2: Compute a number, hcsPi of feature directions using CSP. 

3: For each trial compute the variance giving 2m data points, one for each trial, with m in each class. 

4: Train the LDA classifier on the entire training data consisting of these 2m data points using regularized 

estimates of the covariance matrix at the optimal shrinkage parameter. 
5: Test the classifier's performance on the test set. 



As a final sanity check we test the performance of the following method: instead of performing sLDA and 
penalizing non-stationarity, we test the performance of a trade-off between the Fisher error function and a 
random homogeneous polynomial of degree 2 in the coefficients of the putatively discriminative direction w. 

That is, we use the error function: 

y w 1 (Ei + Y, 2 )w 

Here, R is a random square matrix with coefficients drawn uniformly from [0, 1]. So the method runs as 
per Algorithm [7] 

The rationale for testing randLDA, is that it may be due to a data dependent property that improvement 
using sLDA is possible: a trade off with a random error function may also achieve performance increases. 
We therefore include a random trade off for comparison's sake. 

In summary we test the performance of the following basic methods: 

1. CSP + LDA (Algorithm ^ 

2. CSP + rLDA (Algorithm [5} 

3. CSP + sLDA c.v. (Algorithm |3| 

4. CSP + sLDA with a = a (Algorithm [2) 

5. SSA + CSP + LDA (Algorithm [§ 

Algorithm 6 CSP+gradLDA 
1: Find the discriminative frequency band. 
2: Compute a number, hcsp, of feature directions using CSP. 

3: For each trial compute the variance giving 2m data points, one for each trial, with m in each class. 
4: Train the sLDA classifier on the entire training data consisting of these 2m data points using a = 1 
5: Test the classifier's performance on the test set. 
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Algorithm 7 CSP+randLDA with a = a 
l: Compute a random matrix i?. 
2: Find the discriminative frequency band. . 
3: Compute a number, hcsp, of feature directions using CSP. . 

4: For each trial compute the variance giving 2m data points, one for each trial, with m in each class. 

5: Train the randLDA classifier with a = aon the entire training data consisting of these 2m data points. 

6: Test the classifier's performance on the test set. 



6. CSP + gradLDA (Algorithm |6) 

7. CSP + randLDA (Algorithm [7) with a — a. 

We test the classification performance of all of these methods on the 80 subjects available and report the 
results below. In particular hcsp = 6. The number of epochs chosen for evaluating stationarity in sLDA is 
7; this guarantees the determinacy of any 1 dimensional projection for hcsp < 12 (for details see [S3]). We 
choose k = 5, for cross validation for sLDA and for SSA. The training data in each case consists of 75 trials 
from each class and the test data consists of 150 trials from each class. 



3.6 Results and Discussion of Experiments 

3.6.1 Layout of Figures 

Figure |3.8| displays the individual differences in subject performances between a selection of the methods 
tested for the setting hcsp = 6. Figure [3 . 11 1 displays the average performance of each of the methods at a 
selection of parameter settings. Figure [3 . 1 1 displays the quantiles of the improvement in performance for the 



comparisons made individually in Figure 3.8 Figure 3.12 displays an example (subject no. 74) where sLDA 



induces a large improvement over LDA. Figure [3 . 1 3 1 displays the corresponding scalp plots for subject no. 74 



and Figure 3.14 displays the scalp plots of the sLDA and LDA solutions for the subject (no. 71) for which 



sLDA performed worst. The corresponding p — values for: 

"H = {CSP+yLDA has an error-rate < to the error-rate of CSP+LDA} for y = s,r" 



are displayed using student's paired one sided t-test in Table [3^2] and using the Wilcoxon sign rank test for 
equality of medians in Table |3.1| 



3.6.2 Discussion of Results 

The results displayed in Figure |3.8| show that application of SSA leads to no significant improvement in 
performance against the baseline method (CSP+LDA with hcsp = 6), despite choice of the parameter 
d s (no. of stationary sources) by cross validation. Although SSA improves the performance on a subset 
of individual trials, preprocessing leads to no significant increase in overall performance of classification. 
Therefore we concentrate on comparison with the baseline method CSP+LDA henceforth. 

Firstly, we note that both sLDA with cross validation, sLDA with a = 0.5, sLDA with a = 0.4, gradLDA 
and randLDA with a — 0.6 lead to significant improvements (p <C 0.01) in performance over the baseline 
method LDA and over rLDA; see Tables 13. II and 13.21 
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Algorithm 1 


Algorithm 2 


p- value 


c.v. sLDA 


LDA 


p = 7.60 x 10~ 6 


rLDA 


LDA 


p= 1.96 x 10~ 4 


c.v. sLDA 


rLDA 


p = 0.0153 


c.v. sLDA 


gradLDA 


p=l 


sLDA a = 0.4 


LDA 


p = 1.03 x 10" 5 


sLDA a = 0.5 


LDA 


p= 1.76 x l(T e 


gradLDA 


LDA 


p = 2.05 x 10~ 5 


sLDA a = 0.5 


gradLDA 


p=\ 


randLDA a = 0.6 


LDA 


p = 4.48 x 10" 4 


SSA 


LDA 


p=l 



Table 3.1: p- values for the 80 BCI motor imagery data sets. The p- values given are obtained using a one 
sided paired Wilcoxon sign rank test using a Bonferroni correction factor n = 9 equal to the number of 
tests performed. Each test tests the null hypothesis, effectively that the algorithm in the left hand column 
performs better than the algorithm in the middle column. Each row of the table corresponds to a single 
comparison between methods. The corresponding p-values are displayed in the right hand column. 



Algorithm 1 


Algorithm 2 


p- value 


c.v. sLDA 


LDA 


p = 5.04 x 10" 5 


rLDA 


LDA 


p = 0.0045 


c.v. sLDA 


rLDA 


p = 0.0405 


c.v. sLDA 


gradLDA 


p=l 


sLDA a = 0.4 


LDA 


p = 3.73 x 10~ b 


sLDA a = 0.5 


LDA 


p = 1.94 x 10~ 5 


sLDA a = 0.5 


gradLDA 


p=l 


gradLDA 


LDA 


p = 3.611 x 10" 5 


randLDA a = 0.6 


LDA 


p = 0.0684 


SSA 


LDA 


p=l 



Table 3.2: One sided paired t-tests using a Bonferroni correction factor n = 9 (for the 80 BCI classification 
errors) equal to the number of tests performed. Each test tests the null hypothesis, effectively that the 
algorithm in the left hand column performs better than the algorithm in the middle column. Each row 
corresponds to a comparison of methods. The corresponding p- values are displayed in the right hand column. 
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Thus, we may conclude for these methods, that significant improvement is observed, not due to effects 
equivalent to covariance estimate regularization. 

The method yielding the highest performance is sLDA with a — 0.5. A few datasets are observed where 
gradLDA and randLDA show a decrease in performance but sLDA does not. However, the improvement of 



sLDA with a — 0.5 over, for instance, gradLDA is not significant (see Table 3.2 and Table 3.1|. 

Whether sLDA's incorporation of non-stationarity considerations, in fact, contributes to the observed 
improvement is also doubtful in light of the data analysis in Section |3.6.3| here we observe that the non- 
stationary directions between training and test and the non-stationary direction within training reflect each 
other poorly. Suppose, therefore, that sLDA's effectiveness, here, does not rest on its stationarity incorpora- 
tion: we must suggest alternative explanations for the improvement observed. 

The first possibility, on which improvement may be based, is regularization. If the data is correctly 
modeled as Gaussians, then covariance regularization using the optimal shrinkage parameter (i.e. rLDA) 
should be optimal, since LDA is optimal for Gaussians. So the claim that improvement is solely due to 
regularization is equivalent to claiming that the Gaussian assumption is false. That regularization is the sole 
explanation for improvement may be argued as implausible, in addition, when one takes the simulations from 
Section [3.4.2. 1| into account. In particular, when there is more than one discriminative direction in the data 
set then regularization effects are lower. This is often the case for the data sets at hand, since the feature 
extraction step, CSP, extracts a number of discriminative directions. Additionally, regularization effects are 
never observed to exceed 1% in simulation, whereas the performance increases observed, far exceed 1% (see 



Figure 3.3 for detailed results of our simulated investigation of regularization). 

Thus, suppose that regularization does not fully explain improvement using sLDA, randLDA and gradLDA. 
Are there other possible explanations? One candidate explanation is that improvement depends on the BCI 
data specific non-stationarities between the training and test data. In particular we will observe in the data 
analysis of Section |3.6.3| that there are highly non-random dependencies between the most non-stationary 
directions and the most discriminative. 

According to this possibility, there is a data dependent feature which implies that slight deviation from 
the direction chosen using LDA results in an increase in performance. The increase in performance over LDA 
using gradLDA and randLDA using a — 0.6 may be attributed to the slight difference in direction chosen 
using gradient based optimization. The proposed explanation we suggest is that sLDA with gradient based 
optimization returns a slightly different direction for classification than that produced by LDA, either due 
to local mimima or due to the stochastic nature of the gradient based approach. This slight difference in 
direction from the LDA solution accounts for the increase in performance due to BCI data dependency. 

Thus, we conclude the following: 

1. sLDA produces a significant increase in performance over the baseline method. 

2. There is possibly a data dependent feature under which performance increases are possible using algo- 
rithms which perturb LDA randomly. 

3. sLDA outperforms sLDA with a — 1 but not significantly, (p — 0.26 paired one sided i-test; p = 0.45 
one sided paired Wilcoxon sign-rank without correction for multiple testing — see Tables |3.2| and |3.1| 
for Bonfcrroni corrected p- values.) 
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3.6.3 Analysis of the 80 BCI datasets using group wise SSA 

In this section, we aim to gain insight into the non-stationarity present in BCI datasets. The aim is to 
understand the improvement in performance observed above despite using putatively suboptimal classification 
methods for the task (for example randLDA and sLDA with a = 1). If non-stationarity is correlated with 
the direction of highest discriminability on the training data, then it may be the case that choosing a slightly 
different but nonetheless discriminative direction than chosen by LDA, results in a more stationary direction 
between training and test data. 

In the first step of the analysis we compute the most non-stationary direction between the training and 
test data and the most non-stationary direction within the training data using group wise SSA |48| . The aim 
here is to gain insight into how consistent any attempt to infer the stationarity of directions between training 
and test data using only the training data can be. The method used is as follows: 

1. The data are fixed as the 6 CSP features from the experiments above. 

2. The most non-stationary direction are computed on the training data using 7 epochs. 

3. The most non-stationary direction between training and test data are computed using 2 epochs, one 
for training and one for test. 

4. The angle is computed between the two directions. 

In the next step of the analysis, we compute the angles between the most non-stationary direction between 
training and test and the direction chosen using LDA on the training data, as follows. 

1. The data are fixed as the 6 CSP features from the experiments above. 

2. The LDA direction is computed within these features. 

3. The most non-stationary direction between training and test data is computed within these features 
using 2 epochs group wise SSA, one for training and one for test. 

4. The angle is computed between the two directions. 

The results are displayed in Figure |3.15| Firstly, the results show that the attempt to infer distribu- 
tions over non-stationarity is difficult at best purely on the basis of the training data. The results show 
further that we may reject the hypothesis "Ho = The angle between the two directions is random at a highly 
significant confidence level using the Kolgomorov-Smirnov test (p <C 0.0001) [2] for testing whether the 
sample angles between the directions measured are drawn from the distribution over angles between two 
one-dimensional subspaces defined by vectors drawn from a 6 dimensional multivariate uniform distribution: 
i.e. the distribution defined on [0, |] by the density function p{6) = ^sm A {9) pQ. 

3.6.4 Discussion of the Analysis 

The analysis above shows that the most non-stationary direction between the training and test data for BCI 
datasets often corresponds closely to the direction chosen by LDA. This may explain why gradLDA improves 
performance in the BCI classification task: if we randomly perturb the direction given by LDA slightly, 
then we may obtain a more stationary direction but nevertheless a discriminative direction. That is to say, 
gradLDA slightly perturbs the LDA solution in a random direction. 
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Moreover, a further potential explanation for gradLDA's improvement is the following: if the discrim- 
inative subspace of the data space which is stationary has higher dimensionality than the non-stationary 
subspace and we perturb a direction within this discriminative subspace, then the odds are higher that we 
move further away from the non-stationary subspace than from the stationary subspace. 

A further, neuroscientific, question is: what constitutes, from a neurological point of view, the correlation 
between the most non-stationary direction and the most discriminative direction? A potential answer is that 
the test phase requires the engagement of a slightly different set of cognitive abilities than the training phase: 
for instance, during training, the subject sees the cue to perform a particular motor imagery: see [7] for 
details of the experimental setup. Visual areas may thus provide information in training not available in 
testing and thus a difference in the most discriminative direction which presents a mode of non-stationarity 
which is not susceptible to extraction from the training data alone. This implies that there may often be 
a non-stationary direction which coincides largely with the most discriminative direction on the training 
data. If this direction is simultaneously the most non-stationary direction, then any deviation away from 
the direction chosen by LDA is highly probably more stationary. Although it may appear that this data 
pecularity is merely an artifact of the experimental setup, the phenomenon, whereby different states of the 
subject, including discriminative information, occur in the training and test phases, is pertinent to the design 
of any Brain Computer Interface. The task, therefore, remains, to examine the structure of this systematic 
difference for selected paradigms. 

3.7 Outlook 

We have demonstrated that sLDA leads to significant improvements in performance for BCI classification over 
the standard approach which cannot be explained as the result of implicit eigenvalue-based regularization. 

As we stated above, the aim in defining sLDA was to derive a linear classifier which is robust against 
non-stationarity under the assumption that the classes may be coarsely modeled as Gaussians. We noted that 
specific assumptions must be fulfilled for sLDA to work successfully: non-stationarity in the training set must 
be constant in topography between the test and training phases, but itself non-stationary in distribution over 
time. However when fulfilled, we observed that sLDA leads to improvement in classification performance. 

However, we have observed, in addition, that the improvement for BCI may be data dependent. In 
particular, we successfully showed that there are highly non-random patterns in the BCI non-stationarity. 
The challenge remains, therefore, to describe explain exactly what type of non-stationarity must be present 
to make such improvement possible and whether this understanding may be levered to design methods which 
yield higher improvements in the face of non-stationarity. The exciting task of understanding domain specific 
BCI non-stationarity is also an inviting road for further research. 
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Figure 3.8: The figure displays results given as differences in error-rates in percent on individual subjects 
between the methods, LDA, sLDA with a = 0.5, 0.4 and rLDA, gradLDA, randLDA and sLDA with cross 
validation with CSP processing on the 80 BCI motor-imagery subjects. In each case the comparison is 
displayed in the title to the corresponding panel. The a;-axis corresponds to individual BCI motor imagery 
subjects and the y-axis corresponds to the difference in performance between the methods compared. The 
results show that sLDA yields an improvement over LDA which cannot be explained in terms of implicit 
eigenvalue spectrum related regularization but may be explained in terms of a different form of regularization 
or otherwise, due to improvement observed for randLDA. sLDA with a = 0.5 has the highest mean accuracy 
due to the presence of fewer lower tail outliers. 
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Figure 3.9: The figure displays results given as error-rates scatter plotted in percent between the methods, 
LDA, sLDA with a = 0.5 and rLDA, gradLDA, randLDA and sLDA with cross validation, with CSP 
processing on the 80 BCI motor imagery subjects. A dot below the diagonal represents an increase in 
performance (for eg. sLDA over LDA). The comparison made is displayed in the title and each of the 
methods classification errors are displayed on the x and y-axes. Each dot corresponds to a single BCI 
subject. The plots show that that the improvement of sLDA over LDA is most marked when the error rate 
of the subject under LDA is high, although improvements in performance are observed at lower error-rates 
also. 
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Improvements: quantiles; r^gp = 6 



sLDA/LDA rLDA/LDA sLDA/rLDA \alpha=0.5 \alpha=0.4 gradLDA/LDA randLDA/LDA SSA+LDA/LDA 



Figure 3.10: The figure displays the quantiles (the whiskers describe the full extent of the data, the box 
corresponds to the lower quartile, median and upper quartile; the crosses correspond to outliers) over im- 
provement on the 80 BCI motor- imagery subjects in classification accuracy for the methods LDA, sLDA 
with a = 0.5 and rLDA, gradLDA, randLDA and sLDA with cross validation, in percentage terms. The 
comparison drawn is displayed on the cc-axis. The j/-values display the corresponding increases in accuracy. 
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Figure 3.11: The upper panel figure displays the mean average classification accuracy on the 80 BCI motor 
imagery data sets in percentage terms for LDA, sLDA with a — 0.5 and rLDA, gradLDA, randLDA and 
sLDA with cross validation. The a;- values represent the individual methods and the y- values the corresponding 
accuracies. The lower panel displays the corresponding medians. Whilst sLDA with a — 0.4 obtains the 
highest mean error, randLDA obtains the highest median error. (More trials for randLDA are lower tail 
outliers.) 
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Figure 3.12: The top two panels display the difference modulo the LDA hyperplane between training and 
testing for a single BCI motor imagery subject (subject no. 74). The bottom two panels display the difference 
modulo the sLDA hyperplane between training and testing for the same subject. The blue lines display the 
decision boundary for each classifier. For clarity we plot the principal component of the pooled data against 
the normal vector to each respective hyperplane. Whilst the class distributions shift significantly towards 
the hyperplane for the LDA solution, the shift is less pronounced for sLDA. 
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Figure 3.13: The figure displays scalp plots for the LDA and sLDA solutions for the subject whose im- 
provement was greatest using sLDA (subject no. 74). In upper panel, the field pattern for LDA is displayed, 
whereas in the bottom panel, the difference for LDA is displayed. The difference in topography is most 
striking at the right central to parietal electrode positions. 
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sLDA solution 




Figure 3.14: The figure displays scalp plots for the LDA and sLDA solutions for the subject whose improve- 
ment was most inferior using sLDA (subject no. 71). In upper panel, the field pattern for LDA is displayed, 
whereas in the bottom panel, the difference for LDA is displayed. The difference in topography is most 
striking at the right central to frontal electrode positions. 
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Angles between non-stationary directions and LDA 
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Figure 3.15: The figure displays the angles as a boxplot (the whiskers describe the full extent of the data, 
the box corresponds to the lower quartile, median and upper quartile) for the 80 BCI motor imagery subjects 
between a selection of 1 dimensional subspaces contained within the CSP features. The left hand column 
represents the angle chosen by LDA on the training data and the direction maximizing the non-stationarity 
between training and test data. The second column displays the quantiles over the angles between the most 
non-stationary direction within the training data and the most non-stationary direction between training and 
test data. The third column displays the results one would expect if two subspaces are chosen at random, 
for comparison's sake. The results show that, there is little correlation between the non-stationarities within 
training and between training and test but that there is correlation between the the subspace chosen by LDA 
and the direction maximizing training to test non-stationarity. This leaves the particular suitability of sLDA 
for the BCI data sets doubtful. 



Chapter 4 

Conclusion 



This thesis has contributed to research into non-stationarity in Machine Learning by proposing and testing 
two linear projection algorithms for dealing with non-stationarity. Both are inspired by Stationary Subspace 
Analysis. The first algorithm maximizes non-stationarity under its projection of the dataset and is thus suited 
as a preprocessing method for high-dimensional sequential change-point detection, although its application 
is not in principle limited as such: we conjecture, additionally, that maximizing non-stationarity will prove 
of interest in the analysis of neural data resulting from learning paradigm experimental settings. The second 
algorithm proposed yields improvements under non-stationarity in Brain Computer Interfaces. Although we 
cannot conclude that the improvement in performance observed is a results of non-stationarity penalization, 
we have identified, using SSA and the method developed, sLDA, features of BCI datasets which may represent 
valuable prior information in coping with non-stationarity. 
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